# always clean up R environment
rm(list = ls())
# load all packages here
# Basic Data Analysis & Wrangling
library(tidyverse)
library(lubridate)
# Library for splitting the chinese words. 
library(jiebaRD)
library(jiebaR)
# Library for generating word cloud. 
library(wordcloud)
# Library for visualizing the 3d plot. 
library(plotly)

I - Introduction

Based on definition from Investopedia.com, Social good means “something that benefits the largest number of people in the largest possible way, such as clean air, clean water, healthcare and literacy.” Social good is also referred to as the “common good.” In our topic, we hope to discuss a heated issue within the community of developers. Before we go in depth of the issue, here are a couple things indicating why it’s an important issue:

Overall Chinese Internet environment:

  • A couple companies have tried to enter the lucrative market of mainland China, however, many of them faced obstacles. A famous example is from Google, who specifically designed the dragonfly plan for Chinese Market, that failed due to regulation as well as lack of interest from the general public. eBay was one of the earliest counterparts that laid eyes on Asia market, and all of these big companies failed for the same reason: they all had unsuccessful integration of incompatible cultures. We personally call it “attempt of cultural imperialism within the field of internet.” The big companies took over many places with success without changing much of its business model, or approach of the local audience. However, there is a tremendous culture difference between western and eastern society. One example is the young generate in the United States use Facebook, Twitter, Instagram, Snapchat and other social platforms simultaneously. Teens in China all only use one social media app, Wechat, that could also pay for bills, call an uber, book a movie, find the restaurant. Consumer behavior and culture difference is what pulled these big companies back.

  • Chinese government is known for its regulation and censorship. According to an article published by New York Times, the new president hopes to use the Internet to strengthen Communist Party’s role on the society. Majority of the young generation is indifferent to politics, although many are victims of censorship as well as censorship factory workers.

From a brief introduction of how the overall Chinese Internet environment is different from the United States, and other developed countries, we will now connect to the topic of interest today: 996.ICU event.

This movement is personally significant to our team, because both of us have interacted with the companies mentioned above and have friends and families who work in Tech in China. We have witnessed the consequences caused long hours, and unproductive work in the Tech industry in China. On one hand, the Chinese overall Internet environment is different from the United States, as well as it is at least 10 years behind the U.S.. However, on the other hand, pressuring workers to work long hours would not sufficiently bridge the gap, nor would it be beneficial to technological improvement.

For the rest of our project, we used text analysis, and supervised and unsupervised techniques to dive deep into the problem.

II - Data Preprocessing

Load datasets

dt_issues = read.csv("data/issues_data.csv", header=TRUE)
dt_star = read.csv("data/stargazers.csv", header=TRUE)
dt_user = read.csv("data/users_data.csv", header=TRUE)

Saniety Check

# Inspect the dataset by taking the first 10 rows of each dataset. 
dt_issues %>% head(10)
dt_star %>% head(10)
dt_user %>% head(10)

Data Cleaning & Wrangling

# User dataset cleaning
# Cleaning
dt_user_cld <- dt_user %>% 
    # Join the issues dataset. 
    left_join(dt_issues, by="X_id") %>%
    select(bio, blog, company, created_at.x, followers, following, hireable, location, login, name, public_gists, 
           type, closed_at, updated_at.x, email, organizations_url, public_repos) %>%
    # Drop unused features. 
    # select(-X_id, -avatar_url, -events_url, -followers_url, -following_url, 
    #        -gists_url, -gravatar_id, -html_url, -node_id, -public_gists, -received_events_url, 
    #        -repos_url, -site_admin, -starred_url, -subscriptions_url, -type) %>%
    # Convert the time/date features to relative format. 
    mutate(created_at = lubridate::ymd_hms(created_at.x), 
           updated_at = lubridate::ymd_hms(updated_at.x)) %>%
    # Convert various factor type features to string type. 
    mutate(bio = as.character(bio), 
           blog = as.character(blog), 
           company = as.character(company), 
           email = as.character(email), 
           location = as.character(location), 
           login = as.character(login), 
           name = as.character(name), 
           organizations_url = as.character(organizations_url))
Column `X_id` joining factors with different levels, coercing to character vector
str(dt_user_cld)
'data.frame':   39987 obs. of  19 variables:
 $ bio              : chr  "" "" "" "" ...
 $ blog             : chr  "" "" "" "" ...
 $ company          : chr  "" "" "" "" ...
 $ created_at.x     : Factor w/ 39971 levels "2008-03-26T03:33:42Z",..: 23111 24348 13701 27767 34585 17517 19613 22254 23043 18383 ...
 $ followers        : int  9 4 7 2 0 2 13 5 1 0 ...
 $ following        : int  16 38 14 89 0 0 21 126 48 1 ...
 $ hireable         : Factor w/ 2 levels "","True": 1 1 1 1 1 1 1 1 1 1 ...
 $ location         : chr  "" "" "" "" ...
 $ login            : chr  "moloach" "bhxch" "YueNing" "BigFaceCatMhc" ...
 $ name             : chr  "" "Zhe Lee" "naodongbanana" "" ...
 $ public_gists     : int  0 0 0 0 0 0 1 24 0 0 ...
 $ type             : Factor w/ 1 level "User": 1 1 1 1 1 1 1 1 1 1 ...
 $ closed_at        : logi  NA NA NA NA NA NA ...
 $ updated_at.x     : Factor w/ 38705 levels "2015-10-22T09:47:56Z",..: 14522 14681 25825 31448 31450 37456 23803 31101 4862 19705 ...
 $ email            : chr  "" "mytempbh@outlook.com" "n1085633848@outlook.com" "" ...
 $ organizations_url: chr  "https://api.github.com/users/moloach/orgs" "https://api.github.com/users/bhxch/orgs" "https://api.github.com/users/YueNing/orgs" "https://api.github.com/users/BigFaceCatMhc/orgs" ...
 $ public_repos     : int  10 34 29 4 0 26 93 102 13 2 ...
 $ created_at       : POSIXct, format: "2016-07-12 05:17:50" "2016-08-27 14:04:23" "2015-06-19 13:57:11" ...
 $ updated_at       : POSIXct, format: "2019-03-09 00:15:05" "2019-03-09 09:30:45" "2019-03-25 10:50:50" ...
# Issues dataset cleaning
dt_issues_cld <- dt_issues %>%
    # Drop unused features. 
    select(created_at, body, comments, created_at, title, user.login) %>%
    # Convert the time/date features to relative format. 
    mutate(created_at = lubridate::ymd_hms(created_at), 
           # Convert the factor type features to the correct format. 
           body = as.character(body), 
           title = as.character(title), 
           user.login = as.character(user.login))

III - Data Analysis

1. Supporters’ Profile Analysis

1. What Companies/Universities are those programmers from?

company_info <- dt_user_cld %>% 
    group_by(company) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    filter(company != "")
# Display the top companies. 
company_info
# Define the company aggregation function. 
company_aggregation <- function(name) {
    # Make case insensitive. 
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the company name accordingly. 
    if (grepl("百度|BAIDU|AIDU", name)) {
        target_name <- "Baidu"
    } else if (grepl("ENCENT|腾讯|TENCENT", name)) {
        target_name <- "Tencent"
    } else if (grepl("LIBABA|淘宝|AOBAO|LIPAY|阿里巴巴|LIYUN|阿里云", name)) {
        target_name <- "Alibaba"
    } else if (grepl("JD|京东", name)) {
        target_name <- "JD"
    } else if (grepl("ETEASE|网易", name)) {
        target_name <- "NetEase"
    } else if (grepl("EITUAN|美团", name)) {
        target_name <- "MeiTuan"
    } else if (grepl("YTEDANCE|字节|头条", name)) {
        target_name <- "ByteDance"
    } else if (grepl("ELEME|饿了", name)) {
        target_name <- "Eleme"
    } else if (grepl("UAWEI|华为", name)) {
        target_name <- "Huawei"
    } else if (grepl("DIDI|滴滴|嘀嘀", name)) {
        target_name <- "DiDi"
    } else {
        target_name <- orig_name
    }
    
    return (target_name)
}
# Define the education aggregation function. 
education_aggregation <- function(name) {
    # Make case insensitive
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the education accordingly. 
    if (grepl("HEJIANG|ZJU|浙江大学|浙大", name)) {
        target_name <- "Zhejiang University"
    } else if (grepl("SINGHUA|清华", name)) {
        target_name <- "Tsinghua University"
    } else if (grepl("SHANGHAI JIAO TONG|SJTU|上海交大|上海交通", name)) {
        target_name <- "Shanghai Jiao Tong University"
    } else if (grepl("UESTC|电子科大|电子科技", name)) {
        target_name <- "University of Electronic Science and Technology of China"
    } else if (grepl("USTC|中科大|中国科学技术", name)) {
        target_name <- "University of Science and Technology of China"
    } else if (grepl("FUDAN|复旦", name)) {
        target_name <- "Fudan University"
    } else if (grepl("ARBIN|哈", name)) {
        target_name <- "Harbin Institute of Technology"
    } else if (grepl("BUPT|北邮|北京邮电", name)) {
        target_name <- "Beijing University of Post and Telecommunications"
    } else {
        target_name <- NA
    }
    
    return (target_name)
}
# Aggregating disparse companies. 
agg_companies <- rep(NA, nrow(company_info))
agg_education <- rep(NA, nrow(company_info))
for (i in 1:nrow(company_info)) {
    agg_companies[i] <- company_aggregation(company_info$company[i])
    agg_education[i] <- education_aggregation(company_info$company[i])
}
company_info_agg <- cbind(company_info, agg_companies, agg_education)
# Show the top ten companies which have the most number of developer support 996.icu
company_info_agg %>% group_by(agg_companies) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>% 
    head(10)
# Show what universities are those developers from. 
company_info_agg %>% group_by(agg_education) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    filter(!is.na(agg_education)) %>%
    head(10)
Factor `agg_education` contains implicit NA, consider using `forcats::fct_explicit_na`
NA

2. What cities are those developers from?

# 
# Define the function for aggregating the cities. 
city_aggregation <- function(name) {
    # Make case insensitive. 
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the education accordingly. 
    if (grepl("EIJING|北京", name)) {
        target_name <- "Beijing"
    } else if (grepl("HANGHAI|上海", name)) {
        target_name <- "Shanghai"
    } else if (grepl("ANGZHOU|杭州", name)) {
        target_name <- "Hangzhou"
    } else if (grepl("UANGZHOU|广州", name)) {
        target_name <- "Hangzhou"
    } else if (grepl("HENGDU|成都", name)) {
        target_name <- "Chengdu"
    } else if (grepl("ANJING|南京", name)) {
        target_name <- "Nanjing"
    } else if (grepl("INGAPORE|新加坡", name)) {
        target_name <- "Singapore"
    } else if (grepl("HONG KONG|香港|HK", name)) {
        target_name <- "Hong Kong"
    } else if (grepl("UHAN|武汉", name)) {
        target_name <- "Wuhan"
    } else {
        target_name <- orig_name
    }
    
    return (target_name)
}
city_info <- dt_user_cld %>% 
    group_by(location) %>%
    summarise(count = n()) %>%
    filter(location != "", 
           location != "China") %>%
    arrange(desc(count))
agg_cities <- rep(NA, nrow(city_info))
for (i in 1:nrow(city_info)) {
    agg_cities[i] <- city_aggregation(city_info$location[i])
} 
city_info_agg <- cbind(city_info, agg_cities)
# Showing the top ten cities that have the most developer support 996.icu
city_info_agg %>% group_by(agg_cities) %>%
    summarise(count = n()) %>%
    filter(agg_cities != "", 
           agg_cities != "China") %>%
    arrange(desc(count)) %>%
    head(10)

4. Distribution Plot of Supporters’ Information.

# Distribution graph of supporter's followers under 50. 
dist_ggplot <- dt_user_cld %>% filter(followers <= 50, following <= 50, public_repos <= 50) %>%
    ggplot() +
    geom_bar(aes(x = followers), col="black", fill="black", alpha=0.5) +
    geom_bar(aes(x = following), col="black", fill="red", alpha=0.5) +
    geom_bar(aes(x = public_repos), col="black", fill="blue", alpha=0.5)
    
dist_ggplot + 
    labs(x = "Followers (Black), Following (Red) and Public Repositories (Blue)", 
         y = "Count") +
    ggtitle("Distribution Plot")

5. Distribution Plot of Supporters’ Registration Duration.

# Calculate supporters' number of days since registered the github account. 
today <- lubridate::ymd("2019-04-29")
dt_user_cld <- dt_user_cld %>%
    # Calculate the duration and convert it to numerical value. 
    mutate(duration = as.numeric(as.duration(interval(created_at, today)), "days"))
# Showing average registration years. 
print(mean(dt_user_cld$duration)/365)
[1] 3.321946
# Distribution plot of registration days. 
dt_user_cld %>% ggplot(aes(x = duration/365)) + 
    geom_histogram(col="black", fill="grey", alpha = 0.7) +
    geom_vline(xintercept = mean(dt_user_cld$duration)/365, linetype = "dotted", color = "red", size = 1.5) +
    labs(y = "Frequency / Count", 
         x = "Number of Years Since Registration") +
    ggtitle("Distribution Plot of Supporters' Registration Duration")

2. Statistical Modeling

1. Analyzing the Relationship Between Followers and other factors.

# Select variables for analysis. 
user_stat <- dt_user_cld %>% 
    select(followers, following, public_repos, duration)
# Saniety Check
user_stat %>% head(10)
# Unsupervised Learning: PCA
user_pca <- prcomp(user_stat, center=TRUE, scale.=TRUE)
print(user_pca)
Standard deviations (1, .., p=4):
[1] 1.1689728 0.9810532 0.9575395 0.8684211

Rotation (n x k) = (4 x 4):
                   PC1        PC2         PC3        PC4
followers    0.3419542 -0.8727076  0.30951456  0.1601545
following    0.5162926  0.3258117  0.60978132 -0.5054259
public_repos 0.6008709  0.3352248 -0.09115495  0.7199092
duration     0.5054339 -0.1408987 -0.72391868 -0.4479128
summary(user_pca)
Importance of components:
                          PC1    PC2    PC3    PC4
Standard deviation     1.1690 0.9811 0.9575 0.8684
Proportion of Variance 0.3416 0.2406 0.2292 0.1885
Cumulative Proportion  0.3416 0.5822 0.8115 1.0000
# Supervised Learning: regression
# y = dt_user_cld$followers
# x = dt_user_cld$following, public_repos, duration
lm <- lm(followers ~ following+public_repos+duration, data = dt_user_cld)
summary(lm)

Call:
lm(formula = followers ~ following + public_repos + duration, 
    data = dt_user_cld)

Residuals:
    Min      1Q  Median      3Q     Max 
-1102.5   -12.8    -5.7     1.1 13756.7 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.006e+01  1.276e+00  -7.884 3.24e-15 ***
following     5.212e-02  3.965e-03  13.144  < 2e-16 ***
public_repos  6.555e-02  1.088e-02   6.023 1.73e-09 ***
duration      1.560e-02  9.397e-04  16.605  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 122.6 on 39983 degrees of freedom
Multiple R-squared:  0.01515,   Adjusted R-squared:  0.01508 
F-statistic: 205.1 on 3 and 39983 DF,  p-value: < 2.2e-16
# Because adjusted R-squared is very low, at 0.01508. We decided to further check linear assumptions by plotting it.
plot(lm)

# From the plots, we can tell that this data violated various linear assumption. Therefore, linear regression wouldn't work with this dataset.

3. Main Questions/Issues of Supporters

2. Text Analysis About Topics of 996 Movement.

# Setting the word-split engine
splitter <- worker(stop_word = "data/stopwords.txt")
# Splitting the words. 
seg <- c(splitter[dt_issues_cld$title], splitter[dt_issues_cld$body])
seg <- seg[nchar(seg) > 1]
# Encode the chinese word vector as UTF-8 format. 
Encoding(seg) <- "UTF-8"
# Extract the top 100 
seg_df <- data.frame(seg = seg) %>%
    group_by(seg) %>%
    summarise(freq = n()) %>%
    arrange(desc(freq)) %>%
    head(100)
# Generating word cloud (Chinese Version). 
font_family <- par("family")
par(family = "Adobe Heiti Std R")
wordcloud(words=seg_df$seg, freq=seg_df$freq, 
          colors=brewer.pal(8,"Dark2"), 
          scale=c(4, 0.8))

# Loading translated dataset. 
trans <- read.table("data/translate", sep="\t")[-1,]
seg_df <- cbind(seg_df, engl = trans$V2)
# Generating word cloud (English Version)
wordcloud(words=seg_df$engl, freq=seg_df$freq, 
          colors=brewer.pal(8,"Dark2"), 
          scale=c(4, 0.8))

IV - reference

Kenton, Will. “Social Good.” Investopedia, Investopedia, 12 Mar. 2019, www.investopedia.com/terms/s/social_good.asp.

Wei, Shiyang. “A pilot study on the Chinese internet environment.” International Conference on Advances in Education and Management. Springer, Berlin, Heidelberg, 2011.

Yuan, Li. “Learning China’s Forbidden History, So They Can Censor It.” The New York Times, The New York Times, 2 Jan. 2019, www.nytimes.com/2019/01/02/business/china-internet-censor.html.

“Chinese Developers Use Github to Protest against Country’s 996 Work Schedule.” South China Morning Post, 29 Mar. 2019, www.scmp.com/tech/start-ups/article/3003691/developers-lives-matter-chinese-software-engineers-use-github.

---
title: "996 Analysis"
author: "Zheng Zhang, "
output: html_notebook
---

```{r}
# always clean up R environment
rm(list = ls())

# load all packages here
# Basic Data Analysis & Wrangling
library(tidyverse)
library(lubridate)

# Library for splitting the chinese words. 
library(jiebaRD)
library(jiebaR)

# Library for generating word cloud. 
library(wordcloud)

# Library for visualizing the 3d plot. 
library(plotly)
```


# I - Introduction
### Based on definition from Investopedia.com, Social good means "something that benefits the largest number of people in the largest possible way, such as clean air, clean water, healthcare and literacy." Social good is also referred to as the "common good." In our topic, we hope to discuss a heated issue within the community of developers. Before we go in depth of the issue, here are a couple things indicating why it's an important issue:

### Overall Chinese Internet environment:

* A couple companies have tried to enter the lucrative market of mainland China, however, many of them faced obstacles. A famous example is from Google, who specifically designed the dragonfly plan for Chinese Market, that failed due to regulation as well as lack of interest from the general public. eBay was one of the earliest counterparts that laid eyes on Asia market, and all of these big companies failed for the same reason: they all had unsuccessful integration of incompatible cultures. We personally call it "attempt of cultural imperialism within the field of internet." The big companies took over many places with success without changing much of its business model, or approach of the local audience. However, there is a tremendous culture difference between western and eastern society. One example is the young generate in the United States use Facebook, Twitter, Instagram, Snapchat and other social platforms simultaneously. Teens in China all only use one social media app, Wechat, that could also pay for bills, call an uber, book a movie, find the restaurant. Consumer behavior and culture difference is what pulled these big companies back. 

* Chinese government is known for its regulation and censorship. According to an article published by New York Times, the new president hopes to use the Internet to strengthen Communist Party's role on the society. Majority of the young generation is indifferent to politics, although many are victims of censorship as well as censorship factory workers. 

### From a brief introduction of how the overall Chinese Internet environment is different from the United States, and other developed countries, we will now connect to the topic of interest today: 996.ICU event. 

###  996.ICU is a reference to the grueling and illegal working hours of many tech companies in China - from 9am to 9pm, 6 days a week. The name “996.ICU” came from the description in the repository, “By following the '996' work schedule, you are risking yourself getting into the ICU (Intensive Care Unit).” The event came to the peak when Jack Ma, the founder of the e-commerce giant Alibaba Group, gave the following remarks in mid-April 2019: “It is a huge blessing that we can work 996.” Alibaba owns the Amazon of China, as well as the biggest cloud computing platform in mainland. He said, “If you do not do 996 when you are young, when will you do it. If you don’t put more time and energy than others, how can you achieve the success you want?” Such remark has received controversial comments inside and outside of the country. Currently, 996.ICU repository is ranked No.2 on the Trending page for github, world's largest developer community, right after the repository that hosts all Algorithms implemented in Python. Microsoft and GitHub Workers started their own repository to support 996.ICU movement. 
### This movement is personally significant to our team, because both of us have interacted with the companies mentioned above and have friends and families who work in Tech in China. We have witnessed the consequences caused long hours, and unproductive work in the Tech industry in China. On one hand, the Chinese overall Internet environment is different from the United States, as well as it is at least 10 years behind the U.S.. However, on the other hand, pressuring workers to work long hours would not sufficiently bridge the gap, nor would it be beneficial to technological improvement.

### For the rest of our project, we used text analysis, and supervised and unsupervised techniques to dive deep into the problem.


# II - Data Preprocessing

### Load datasets

```{r}
dt_issues = read.csv("data/issues_data.csv", header=TRUE)
dt_star = read.csv("data/stargazers.csv", header=TRUE)
dt_user = read.csv("data/users_data.csv", header=TRUE)
```

### Saniety Check

```{r}
# Inspect the dataset by taking the first 10 rows of each dataset. 
dt_issues %>% head(10)
dt_star %>% head(10)
dt_user %>% head(10)
```

### Data Cleaning & Wrangling

```{r}
# User dataset cleaning
# Cleaning
dt_user_cld <- dt_user %>% 
    # Join the issues dataset. 
    left_join(dt_issues, by="X_id") %>%
    select(bio, blog, company, created_at.x, followers, following, hireable, location, login, name, public_gists, 
           type, closed_at, updated_at.x, email, organizations_url, public_repos) %>%
    # Drop unused features. 
    # select(-X_id, -avatar_url, -events_url, -followers_url, -following_url, 
    #        -gists_url, -gravatar_id, -html_url, -node_id, -public_gists, -received_events_url, 
    #        -repos_url, -site_admin, -starred_url, -subscriptions_url, -type) %>%
    # Convert the time/date features to relative format. 
    mutate(created_at = lubridate::ymd_hms(created_at.x), 
           updated_at = lubridate::ymd_hms(updated_at.x)) %>%
    # Convert various factor type features to string type. 
    mutate(bio = as.character(bio), 
           blog = as.character(blog), 
           company = as.character(company), 
           email = as.character(email), 
           location = as.character(location), 
           login = as.character(login), 
           name = as.character(name), 
           organizations_url = as.character(organizations_url))

str(dt_user_cld)

# Issues dataset cleaning
dt_issues_cld <- dt_issues %>%
    # Drop unused features. 
    select(created_at, body, comments, created_at, title, user.login) %>%
    # Convert the time/date features to relative format. 
    mutate(created_at = lubridate::ymd_hms(created_at), 
           # Convert the factor type features to the correct format. 
           body = as.character(body), 
           title = as.character(title), 
           user.login = as.character(user.login))
```

# III - Data Analysis

## 1. Supporters' Profile Analysis

### 1. What Companies/Universities are those programmers from?

```{r}
company_info <- dt_user_cld %>% 
    group_by(company) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    filter(company != "")

# Display the top companies. 
company_info
```


```{r}
# Define the company aggregation function. 
company_aggregation <- function(name) {
    # Make case insensitive. 
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the company name accordingly. 
    if (grepl("百度|BAIDU|AIDU", name)) {
        target_name <- "Baidu"
    } else if (grepl("ENCENT|腾讯|TENCENT", name)) {
        target_name <- "Tencent"
    } else if (grepl("LIBABA|淘宝|AOBAO|LIPAY|阿里巴巴|LIYUN|阿里云", name)) {
        target_name <- "Alibaba"
    } else if (grepl("JD|京东", name)) {
        target_name <- "JD"
    } else if (grepl("ETEASE|网易", name)) {
        target_name <- "NetEase"
    } else if (grepl("EITUAN|美团", name)) {
        target_name <- "MeiTuan"
    } else if (grepl("YTEDANCE|字节|头条", name)) {
        target_name <- "ByteDance"
    } else if (grepl("ELEME|饿了", name)) {
        target_name <- "Eleme"
    } else if (grepl("UAWEI|华为", name)) {
        target_name <- "Huawei"
    } else if (grepl("DIDI|滴滴|嘀嘀", name)) {
        target_name <- "DiDi"
    } else {
        target_name <- orig_name
    }
    
    return (target_name)
}

# Define the education aggregation function. 
education_aggregation <- function(name) {
    # Make case insensitive
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the education accordingly. 
    if (grepl("HEJIANG|ZJU|浙江大学|浙大", name)) {
        target_name <- "Zhejiang University"
    } else if (grepl("SINGHUA|清华", name)) {
        target_name <- "Tsinghua University"
    } else if (grepl("SHANGHAI JIAO TONG|SJTU|上海交大|上海交通", name)) {
        target_name <- "Shanghai Jiao Tong University"
    } else if (grepl("UESTC|电子科大|电子科技", name)) {
        target_name <- "University of Electronic Science and Technology of China"
    } else if (grepl("USTC|中科大|中国科学技术", name)) {
        target_name <- "University of Science and Technology of China"
    } else if (grepl("FUDAN|复旦", name)) {
        target_name <- "Fudan University"
    } else if (grepl("ARBIN|哈", name)) {
        target_name <- "Harbin Institute of Technology"
    } else if (grepl("BUPT|北邮|北京邮电", name)) {
        target_name <- "Beijing University of Post and Telecommunications"
    } else {
        target_name <- NA
    }
    
    return (target_name)
}

# Aggregating disparse companies. 
agg_companies <- rep(NA, nrow(company_info))
agg_education <- rep(NA, nrow(company_info))
for (i in 1:nrow(company_info)) {
    agg_companies[i] <- company_aggregation(company_info$company[i])
    agg_education[i] <- education_aggregation(company_info$company[i])
}
company_info_agg <- cbind(company_info, agg_companies, agg_education)
```


```{r}
# Show the top ten companies which have the most number of developer support 996.icu
company_info_agg %>% group_by(agg_companies) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>% 
    head(10)

# Show what universities are those developers from. 
company_info_agg %>% group_by(agg_education) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    filter(!is.na(agg_education)) %>%
    head(10)
    
```

### 2. What cities are those developers from? 


```{r}
# 
# Define the function for aggregating the cities. 
city_aggregation <- function(name) {
    # Make case insensitive. 
    orig_name <- name
    name <- toupper(name)
    # Detect pattern and change the education accordingly. 
    if (grepl("EIJING|北京", name)) {
        target_name <- "Beijing"
    } else if (grepl("HANGHAI|上海", name)) {
        target_name <- "Shanghai"
    } else if (grepl("ANGZHOU|杭州", name)) {
        target_name <- "Hangzhou"
    } else if (grepl("UANGZHOU|广州", name)) {
        target_name <- "Hangzhou"
    } else if (grepl("HENGDU|成都", name)) {
        target_name <- "Chengdu"
    } else if (grepl("ANJING|南京", name)) {
        target_name <- "Nanjing"
    } else if (grepl("INGAPORE|新加坡", name)) {
        target_name <- "Singapore"
    } else if (grepl("HONG KONG|香港|HK", name)) {
        target_name <- "Hong Kong"
    } else if (grepl("UHAN|武汉", name)) {
        target_name <- "Wuhan"
    } else {
        target_name <- orig_name
    }
    
    return (target_name)
}

city_info <- dt_user_cld %>% 
    group_by(location) %>%
    summarise(count = n()) %>%
    filter(location != "", 
           location != "China") %>%
    arrange(desc(count))

agg_cities <- rep(NA, nrow(city_info))
for (i in 1:nrow(city_info)) {
    agg_cities[i] <- city_aggregation(city_info$location[i])
} 

city_info_agg <- cbind(city_info, agg_cities)
```

```{r}
# Showing the top ten cities that have the most developer support 996.icu
city_info_agg %>% group_by(agg_cities) %>%
    summarise(count = n()) %>%
    filter(agg_cities != "", 
           agg_cities != "China") %>%
    arrange(desc(count)) %>%
    head(10)
```

### 3. Summary Statistics of Supporters' Related Information. 

```{r}
# Summary statistics of supporters' github account. 
dt_user_cld %>% select(followers, following, public_repos) %>%
    gather(stat_type, number, followers, following, public_repos) %>%
#    filter(number <= 10000) %>%
    ggplot(aes(x = factor(stat_type), y = log(number))) +
    geom_jitter(color = "grey", width = .2) +
    geom_boxplot(alpha=0.6) +
    stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 15) +
    labs(x = "Summary Statistics of Followers, Following and Public Repos, Mean (red)", 
         y = "Relative Values") +
    ggtitle("Summary Statistics Plot (Log Transformed)")
    
    
```

### 4. Distribution Plot of Supporters' Information. 

```{r}
# Distribution graph of supporter's followers under 50. 
dist_ggplot <- dt_user_cld %>% filter(followers <= 50, following <= 50, public_repos <= 50) %>%
    ggplot() +
    geom_bar(aes(x = followers), col="black", fill="black", alpha=0.5) +
    geom_bar(aes(x = following), col="black", fill="red", alpha=0.5) +
    geom_bar(aes(x = public_repos), col="black", fill="blue", alpha=0.5)
    
dist_ggplot + 
    labs(x = "Followers (Black), Following (Red) and Public Repositories (Blue)", 
         y = "Count") +
    ggtitle("Distribution Plot")
```

### 5. Distribution Plot of Supporters' Registration Duration. 

```{r}
# Calculate supporters' number of days since registered the github account. 
today <- lubridate::ymd("2019-04-29")
dt_user_cld <- dt_user_cld %>%
    # Calculate the duration and convert it to numerical value. 
    mutate(duration = as.numeric(as.duration(interval(created_at, today)), "days"))

# Showing average registration years. 
print(mean(dt_user_cld$duration)/365)

# Distribution plot of registration days. 
dt_user_cld %>% ggplot(aes(x = duration/365)) + 
    geom_histogram(col="black", fill="grey", alpha = 0.7) +
    geom_vline(xintercept = mean(dt_user_cld$duration)/365, linetype = "dotted", color = "red", size = 1.5) +
    labs(y = "Frequency / Count", 
         x = "Number of Years Since Registration") +
    ggtitle("Distribution Plot of Supporters' Registration Duration")
```

## 2. Statistical Modeling

### 1. Analyzing the Relationship Between Followers and other factors. 

```{r}
# Select variables for analysis. 
user_stat <- dt_user_cld %>% 
    select(followers, following, public_repos, duration)

# Saniety Check
user_stat %>% head(10)
```


```{r}
# Unsupervised Learning: PCA
user_pca <- prcomp(user_stat, center=TRUE, scale.=TRUE)
print(user_pca)
summary(user_pca)
```

```{r}
# Supervised Learning: regression
# y = dt_user_cld$followers
# x = dt_user_cld$following, public_repos, duration

lm <- lm(followers ~ following+public_repos+duration, data = dt_user_cld)
summary(lm)
# Because adjusted R-squared is very low, at 0.01508. We decided to further check linear assumptions by plotting it.

plot(lm)
# From the plots, we can tell that this data violated various linear assumption. Therefore, linear regression wouldn't work with this dataset.
```



## 3. Main Questions/Issues of Supporters

### 1. Trending Issues. 

```{r}
# Showing top ten issues with most comments. 
dt_issues_cld %>% 
    select(title, comments) %>% 
    arrange(desc(comments)) %>%
    head(10)

# Psuedo top ten issues (translated)
psuedo_issues <- data.frame(
    title = c("Discussion Thread", 
              "Any 'Working under 996, sicking in ICU' wallpapers to use?", 
              "Afterwards, I could put 'participated in an open-source project with over 2000+ stars' on my resume", 
              "I don't understant the law, but I'm wondering if there is any legal issue involved?", 
              "Can this repository be in the top-ten stars list on GitHub?", 
              "Substantial suggestions regarding the anti-996 movements.", 
              "It's ugly that the developers taking salaries while complaining about their companies.", 
              "Working overtime tonight, will delete the database when this repo reach over 100k stars", 
              "Cute girl born in 1996 is looking for developer boyfriend now.", 
              "Worship the original post"), 
    comments = c(1243, 62, 53, 39, 37, 30, 30, 26, 25, 24)) %>%
    mutate(title = as.character(title))

psuedo_issues
```

### 2. Text Analysis About Topics of 996 Movement.

```{r}
# Setting the word-split engine
splitter <- worker(stop_word = "data/stopwords.txt")

# Splitting the words. 
seg <- c(splitter[dt_issues_cld$title], splitter[dt_issues_cld$body])
seg <- seg[nchar(seg) > 1]
# Encode the chinese word vector as UTF-8 format. 
Encoding(seg) <- "UTF-8"
# Extract the top 100 
seg_df <- data.frame(seg = seg) %>%
    group_by(seg) %>%
    summarise(freq = n()) %>%
    arrange(desc(freq)) %>%
    head(100)

# Generating word cloud (Chinese Version). 
font_family <- par("family")
par(family = "Adobe Heiti Std R")
wordcloud(words=seg_df$seg, freq=seg_df$freq, 
          colors=brewer.pal(8,"Dark2"), 
          scale=c(4, 0.8))
```


```{r}
# Loading translated dataset. 
trans <- read.table("data/translate", sep="\t")[-1,]
seg_df <- cbind(seg_df, engl = trans$V2)
```

```{r warning=FALSE}
# Generating word cloud (English Version)
wordcloud(words=seg_df$engl, freq=seg_df$freq, 
          colors=brewer.pal(8,"Dark2"), 
          scale=c(4, 0.8))
```


# IV - reference
#### Kenton, Will. “Social Good.” Investopedia, Investopedia, 12 Mar. 2019, www.investopedia.com/terms/s/social_good.asp.

#### Wei, Shiyang. "A pilot study on the Chinese internet environment." International Conference on Advances in Education and Management. Springer, Berlin, Heidelberg, 2011.

#### Yuan, Li. “Learning China's Forbidden History, So They Can Censor It.” The New York Times, The New York Times, 2 Jan. 2019, www.nytimes.com/2019/01/02/business/china-internet-censor.html.

#### “Chinese Developers Use Github to Protest against Country's 996 Work Schedule.” South China Morning Post, 29 Mar. 2019, www.scmp.com/tech/start-ups/article/3003691/developers-lives-matter-chinese-software-engineers-use-github.




















